ground-level image
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts (0.04)
- Europe > Sweden (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- (2 more...)
Scaling Image Geo-Localization to Continent Level
Lindenberger, Philipp, Sarlin, Paul-Edouard, Hosang, Jan, Balice, Matteo, Pollefeys, Marc, Lynen, Simon, Trulls, Eduard
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.
- Europe > Western Europe (0.04)
- Europe > Belgium (0.04)
- North America > United States > Massachusetts (0.04)
- (12 more...)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts (0.04)
- Europe > Sweden (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- (2 more...)
TaxaBind: A Unified Embedding Space for Ecological Applications
Sastry, Srikumar, Khanal, Subash, Dhakal, Aayush, Ahmad, Adeel, Jacobs, Nathan
We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.
- Africa > Kenya (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > Hawaii (0.04)
- (3 more...)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
Sarkar, Anindya, Sastry, Srikumar, Pirinen, Aleksis, Zhang, Chongjie, Jacobs, Nathan, Vorobeychik, Yevgeniy
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. This could emulate a UAV involved in a search-and-rescue operation navigating through an area, observing a stream of aerial images as it goes. The AGL task is associated with two important challenges. Firstly, an agent must deal with a goal specification in one of multiple modalities (e.g., through a natural language description) while the search cues are provided in other modalities (aerial imagery). The second challenge is limited localization time (e.g., limited battery life, urgency) so that the goal must be localized as efficiently as possible, i.e. the agent must effectively leverage its sequentially observed aerial views when searching for the goal. To address these challenges, we propose GOMAA-Geo - a goal modality agnostic active geo-localization agent - for zeroshot generalization between different goal modalities. Our approach combines cross-modality contrastive learning to align representations across modalities with supervised foundation model pretraining and reinforcement learning to obtain highly effective navigation and localization policies. Through extensive evaluations, we show that GOMAA-Geo outperforms alternative learnable approaches and that it generalizes across datasets - e.g., to disaster-hit areas without seeing a single disaster scenario during training - and goal modalities - e.g., to ground-level imagery or textual descriptions, despite only being trained with goals specified as aerial views. Code and models will be made publicly available at this link.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts (0.04)
- Europe > Sweden (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Bird's-Eye View to Street-View: A Survey
Bajbaa, Khawlah, Usman, Muhammad, Anwar, Saeed, Radwan, Ibrahim, Bais, Abdul
In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.
- Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- (5 more...)
GEOBIND: Binding Text, Image, and Audio through Satellite Images
Dhakal, Aayush, Khanal, Subash, Sastry, Srikumar, Ahmad, Adeel, Jacobs, Nathan
In remote sensing, we are interested in modeling various modalities for some geographic location. Several works have focused on learning the relationship between a location and type of landscape, habitability, audio, textual descriptions, etc. Recently, a common way to approach these problems is to train a deep-learning model that uses satellite images to infer some unique characteristics of the location. In this work, we present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location. To do this, we use satellite images as the binding element and contrastively align all other modalities to the satellite image data. Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text. Furthermore, our approach does not require a single complex dataset that contains all the modalities mentioned above. Rather it only requires multiple satellite-image paired data. While we only align three modalities in this paper, we present a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element. Our results show that, unlike traditional unimodal models, GeoBind is versatile and can reason about multiple modalities for a given satellite image input.
- North America > United States > Missouri > St. Louis County > St. Louis (0.05)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > China (0.04)
Cross-Modal Learning of Housing Quality in Amsterdam
Levering, Alex, Marcos, Diego, Tuia, Devis
In our research we test data and models for the recognition of housing quality in the city of Amsterdam from ground-level and aerial imagery. For ground-level images we compare Google StreetView (GSV) to Flickr images. Our results show that GSV predicts the most accurate building quality scores, approximately 30% better than using only aerial images. However, we find that through careful filtering and by using the right pre-trained model, Flickr image features combined with aerial image features are able to halve the performance gap to GSV features from 30% to 15%. Our results indicate that there are viable alternatives to GSV for liveability factor prediction, which is encouraging as GSV images are more difficult to acquire and not always available.
- Europe > Netherlands > North Holland > Amsterdam (0.63)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New York (0.04)
- (2 more...)
Creating Ground-level Views from Satellite Imagery
Many techniques, using statistics or artificial intelligence, exist that help classify and identify areas on satellite imagery. This includes land use characteristics such as urban spaces, agriculture lands, forests, etc. However, recreating a ground-level image and perspective using satellite imagery has only recently been developed and is now an active area of research. Such work has the potential to not only classify land more accurately but it can also provide a ground-level perspective that indicates how it differs or is like other similar classes. One pioneering technique developed in providing ground-level views from satellite images was developed by the University of California, Merced.
Given a satellite image, machine learning creates the view on the ground
Leonardo da Vinci famously created drawings and paintings that showed a bird's eye view of certain areas of Italy with a level of detail that was not otherwise possible until the invention of photography and flying machines. Indeed, many critics have wondered how he could have imagined these details. But now researchers are working on the inverse problem: given a satellite image of Earth's surface, what does that area look like from the ground? How clear can such an artificial image be? Today we get an answer thanks to the work of Xueqing Deng and colleagues at the University of California, Merced.
- North America > United States > California > Merced County > Merced (0.25)
- Europe > Italy (0.25)